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Abstract. In this paper, we present the following results: (1) We pro¬ 
pose a new dynamic compressed index of 0{w) space, that supports 
searching for a pattern P in the current text in 0{\P\f{M, w)-(-log w log \P\ log* M(log N+ 
log |P| log* M) -I- oce log N) time and insertion/deletion of a substring of 
length y in 0{{y -|- log A log* M) log re log A log* M) time, where A is 
the length of the current text, M is the maximum length of the dy¬ 
namic text, 2 is the size of the Lempel-Ziv77 (LZ77) factorization of 
the current text, f{a,b) = 0 (min{i 2 SM^i^ 2 si;, and w = 

0(2 log A log* M). (2) We propose a new space-efficient LZ77 factoriza¬ 
tion algorithm for a given text of length A, which runs in 0(A/(A, w’) + 

2 log u)'log® A(log* A)^) time with 0{w') working space, where w' = 

0{z log A log* A). (3) We propose a data structure of 0{w) space which 
supports longest common extension (LCE) queries on the text in 0(log A-|- 
log i log* A) time, where i is the output LCE length. On top of the above 
contributions, we show several applications of our data structures which 
improve previous best known results on grammar-compressed string pro¬ 
cessing. 


1 Introduction 

1.1 Dynamic compressed index 

In this paper, we consider the dynamic compressed text indexing problem of 
maintaining a compressed index for a text string that can be modified. Although 
there exits several dynamic non-eompressed text indexes (see e.g. [2713] for re¬ 
cent work), there has been little work for the compressed variants. Hon et al. [13] 
proposed the first dynamic compressed index of 0{^{NHo -|- A)) bits of space 
which supports searching of P in 0(|P| log^ A(log'^ N + log lAj) -|- occ log^"’’*^ N) 
time and insertion/deletion of a substring of length y in 0{{y + VN) log^^'^ N) 
amortized time, where 0 < e < 1 and Hq < log|if| denotes the zeroth order 
empirical entropy of the text of length N na. Salson et al. [2^ also proposed a 
dynamic compressed index, called dynamic FM-Index. Although their approach 






works well in practice, updates require 0{N log N) time in the worst case. To 
our knowledge, these are the only existing dynamic compressed indexes to date. 
In this paper, we propose a new dynamic compressed index, as follows: 

Theorem 1. Let M he the maximum length of the dynamic text to index, N 
the length of the current text T, and z the number of factors in the Lempel- 
Ziv 77 factorization of T without self-references. Then, there exist a dynamic 
index of 0{w) space which supports searching of a pattern P in 0(|P|/_4 + 
log w log |P| log* M(log A^+log |P| log* M)+occlog N) time and insertion/deletion 
of a substring of length y in amortized 0(fy + log log* M) logic logiV log* M) 

time, where w = 0(zlogiVlog* M) and U = 0(min{iH^i;^f|^, 

Since 2 > logA^, logic = maxjlogz,log(log* M)}. Hence, our index is able to 
find pattern occurrences faster than the index of Hon et al. when the |P| term is 
dominating in the pattern search times. Also, our index allows faster substring 
insertion/deletion on the text when the -s/N term is dominating. 


Related work. Our dynamic compressed index uses Mehlhorn et al.’s locally 
consistent parsing and signature encodings of strings |22] . originally proposed 
for efficient equality testing of dynamic strings. Alstrup et al. [3] showed how 
to improve the construction time of Mehlhorn et al.’s data structure (details 
can be found in the technical report m)- Our data structure uses Alstrup et 
al.’s fast string concatenation/split algorithms and linear-time computation of 
locally consistent parsing, but has little else in common than those. In particular, 
Alstrup et al.’s dynamic pattern matching algorithm [312] requires to maintain 
specific locations called anchors over the parse trees of the signature encodings, 
but our index does not use anchors. 

Our index has close relationship to the ESP-indices [301311 . but there are two 
significant differences between ours and ESP-indices: The first difference is that 
the ESP-index [5D] is static and its online variant [3T] allows only for appending 
new characters to the end of the text, while our index is fully dynamic allowing for 
insertion and deletion of arbitrary substrings at arbitrary positions. The second 
difference is that the pattern search time of the ESP-index is proportional to 
the number occc of occurrences of the so-called “core” of a query pattern P, 
which corresponds to a maximal subtree of the ESP derivation tree of a query 
pattern P. If occ is the number of occurrences of P in the text, then it always 
holds that ocCc > occ, and in general ocCc cannot be upper bounded by any 
function of occ. In contrast, as can be seen in Theorem [H the pattern search 
time of our index is proportional to the number occ of occurrences of a query 
pattern P. This became possible due to our discovery of a new property of the 
signature encoding [2] (stated in Lemma [T2|) . In relation to our problem, there 
exists the library management problem of maintaining a text collection (a set 
of text strings) allowing for insertion/deletion of texts (see [ 23 ] for recent work). 
While in our problem a single text is edited by insertion/deletion of substrings, 
in the library management problem a text can be inserted to or deleted from 
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the collection. Hence, algorithms for the library management problem cannot be 
directly applied to our problem. 

1.2 Applications and extensions 

Computing LZ77 factorization in compressed space. As an application 
to our dynamic compressed index, we present a new LZ77 factorization algo¬ 
rithm for a string T of length TV, running in 0[N + z log w log^ TV (log* TV)^) 

time and 0{w) working space, where Ja = 0(min{ 

Goto et al. m showed how, given the grammar-like representation for string 
T generated by the LCA algorithm [28], to compute the LZ77 factorization of 
T in 0 ( 2 : log^ m log^ A-I-m logm log^ A) time and O(mlog^m) space, where 
m is the size of the given representation. Sakamoto et al. |28j claimed that 
m = 0(zlog Alog* A), however, it seems that in this bound they do not con¬ 
sider the production rules to represent maximal runs of non-terminals in the 
derivation tree. The bound we were able to obtain with the best of our knowl¬ 
edge and understanding is m = 0(z log^ A log* A), and hence our algorithm 
seems to use less space than the algorithm of Goto et al. Recently, Fischer 
et al. nni showed a Monte-Garlo randomized algorithms to compute an approx¬ 
imation of the LZ77 factorization with at most 2z factors in O(AlogA) time, 
and another approximation with at most {i + e)z factors in 0{N log^ A) time for 
any constant e > 0, using 0{z) space each. Another line of research is a recent 
result by Policriti and Prezza [25] which uses NHq + o{N log |2T|) -|- 0(|2T| log A) 
bits of space and computes the LZ77 factorization in 0(Alog A) time. 


Longest common extension queries in compressed space. Furthermore, 
we consider the longest common extension (LGE) problems on: an uncompressed 
string T of length A; a grammar-compressed string T represented by an straight- 
line program (SLP) of size n, or an LZ77-compressed string T with z factors. The 
best known deterministic LGE data structure on SLPs is due to I et al. m , which 
supports LGE queries in O(hlogA) time each, occupies 0{n^) space, and can 
be built in 0{hn^) time, where h is the height of the derivation tree of a given 
SLP. Bille et al. [5] showed a Monte Garlo randomized data structure built on 
a given SLP of size n which supports LGE queries in 0(log Alogf) time each, 
where t is the output of the LGE query and A is the length of the uncompressed 
text. Their data structure requires only 0(n) space, but requires 0(A) time to 
construct. Very recently, Bille et al. |5] showed a faster Monte Carlo randomized 
data structure of 0{n) space which supports LGE queries in 0(log A -|- \o^ t) 
time each. The preprocessing time of this new data structure is not given in . 

In this paper, we present a new, deterministic LGE data structure using 
compressed space, namely 0(w) space, supporting LGE queries in 0(log A -|- 
log t log* A) time each. We show how to construct this data structure in OiN log vS) 
time given an uncompressed string of length A, 0(n log log n log A log* A) time 
given an SLP of size n, and 0(z log ui log A log* A) time given the LZ77 factor¬ 
ization of size z. We remark that our new LGE data structure allows for fastest 
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deterministic LCE queries on SLPs, and even permits faster LCE queries than 
the randomized data structure of Bille et al. [5] when log* N = o(log i) which in 
many cases is true. 

All proofs omitted due to lack of space can be found in the appendices. 

2 Preliminaries 

2.1 Strings 

Let S be an ordered alphabet and $ be the lexicographically largest character in 
E. An element of E* is called a string. Eor string w = xyz, x is called a prefix, 
y is called a substring, and z is called a suffix of w, respectively. The length of 
string w is denoted by |r/;|. The empty string e is a string of length 0, that is, 
|e| = 0. Let E~^ = E* — {e}. Eor any 1 < z < |w|, w[z] denotes the z-th character 
of w. Eor any 1 < i < j < |w|, w[i..j] denotes the substring of w that begins 
at position i and ends at position j. Let w[i..] = w[z..|w|] and w[..z] = w[l..z] 
for any 1 < z < Izcl. For any string w, let denote the reversed string of w, 
that is, = zz;[|zi;|] • • • z/;[2]w[l]. For any strings w and u, let LCP(zn,u) (resp. 
LCS {w,u)) denote the length of the longest common prefix (resp. suffix) of w 
and u. Given two strings si,S 2 and two integers i,j, let LCE(si, S 2 , z, j) denote 
a query which returns LCP(si[z..|si|], S 2 [j..|s 2 |]). 

For any strings p and s, let Occ{p, s) denote all occurrence positions of p in 
s, namely, Occ{p, s) = {z | p = s[z..z + |p| — 1], 1 < z < |s| — \p\ + 1}. 

In this paper, we deal with a dynamic text, namely, we allow for inser¬ 
tion/deletion of a substring to/from an arbitrary position of the text. Let M be 
the maximum length of the dynamic text to index. Our model of computation 
is the unit-cost word RAM with machine word size of log 2 M bits, and space 
complexities will be evaluated by the number of machine words. Bit-oriented 
evaluation of space complexities can be obtained with a log 2 M multiplicative 
factor. 

Lempel-Ziv 77 factorization. We will use the Lempel-Ziv 77 factorization [32] 
of a string to bound the running time and the size of our data structure on the 
string. It is a greedy factorization which scans the string from left to right, and 
recursively takes as a factor the longest prefix of the remaining suffix with a 
previous occurrence. Formally, it is defined as follows. 

Definition 1 (Lenipel-Ziv77 Factorization |32| L The Lempel-Ziv77 (LZ77) 
factorization of a string s without self-references is a sequence /i,..., /^ of non¬ 
empty substrings of s such that s = fi ■ ■ ■ fz, /i = s[l], and for 1 < i < z, if the 
c/zaracter s[|/i../i_i|-|-l] does not occur in s[\fi..fi-i\], then fi = s[|/i../i_i|-|-l], 
otherwise fi is the longest prefix of fi - ■■ fz which occurs in fi ■■ ■ fi-i. 

The size of the LZ77 factorization /i ,..., /^ of string s is the number z of factors 
in the factorization. 

A variant of \7L77 factorization which allows for self-overlapping reference to 
a previous occurrence is formally defined as follows. 
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Definition 2 (Lempel-ZivTT Factorization with self-reference |32| i. The 

Lempel-Ziv?? (LZ77) factorization of a string s with self-references is a sequence 
fi,..., fk of non-empty substrings of s such that s = fi ■ ■ ■ fk, fi = s[l], and for 
1 < i < k, if the character -|- 1] does not occur in s[|/i../i_i|], then 

fi = s[|/i../i_i| -|- 1], otherwise ft is the longest prefix of fi - ■ ■ fk which occurs 
at some position p, where 1 < p < |/i ■ ■ ■ fi-i\- 

We will show that using our data structure, the LZ77 with self-reference can be 
computed efficiently in compressed space. 


Locally consistent parsing. Let p be a string of length n over an integer 
alphabet of size W where any adjacent elements are different, i.e., p\i] ^ p\i -\- 
1] for all 1 < z < n. A locally consistent parsing of p is a parsing (or 
factorization) qi,... ,qj of p such that p = qi ■■■ qj, 2 < \qk\ < 4 for any 1 < h < 
j, and the boundary between qh-i and q^ is “determined” by p[\qi ■ ■ ■ qn-il + 
1 - AL..\qi ■ ■ ■ qh-i\ + 1 + Ar], where Ar = log* W -\-6 and Ar = 4. Clearly, 
j < n/2. By “determined” above, we mean that if a position i of an integer 
string p and a position k of another integer string s share the same left context 
of length at least Ar and the same right context of length at least Ar, then 
there is a boundary of the locally consistent parsing of p between the positions 
i — 1 and i iff there is a boundary of the locally consistent parsing of s between 
the positions k — 1 and k. A formal definition of locally consistent parsing, and 
its linear-time computation algorithm, is explained in the following lemma. 

Lemma 1 (Locally consistent parsing |22ll2] L Let W be a non-negative 
integer and letp he an integer sequence of length n, called a VL-colored sequence, 
where p[i] ^ p[i -\- 1] for any 1 < i < n and 0 < p[j] < W for any 1 < j < n. For 
every W there exists a function f : [—l..VL]'°® ^ {0,1} such that for every 

W-colored sequence p, the bit sequence d defined by d[i] = f{p[i — Ar], . .. ,p[z -|- 
Ar\) for 1 <i <n, satisfies: 

— d[l] = 1, 

— d[z] -|- d[z -|- 1] < 1 for 1 < i < n, and 

— d[z] -I- d[i -f 1] -f d[i -I- 2] -I- d[z -I- 3] > 1 for any 1 < i < n — 3, 

where Ar = log* W -\- Q, Ar = 4, and p[j] = p[j] for all 1 < j < n, p[j] = —1 
otherwise. Furthermore, d can he computed in 0(n) time using a precomputed 
table of size o(logW). Also, we can compute this table in o(logW) time. 

Proof. Here we give only an intuitive description of a proof of Lemma [T] More 
detailed proofs can be found at [22] and [2]. 

Mehlhorn et al. [22] showed that there exists a function /' which returns 
a (log VL)-colored sequence p' for a given VL-colored sequence p in 0(|p|) time, 
where p'[i] is determined only by p[i — 1] and p[i] for 1 < z < |p|. Let p^^^ denote 
the outputs after applying /' to p by fc times. They also showed that there 
exists a function f" which returns a bit sequence d satisfying the conditions of 
Lemma[T]for a 6-colored sequence p in 0(|p|) time, where d[i] is determined only 
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by p[i —3..J + 3] for 1 < j < \p\. Hence we can compute d for a H^-colored sequence 
p in 0(|p| log* W) time by applying /" to ^^+2) after computing ^^+2). 
Furthermore, Alstrup et al. [5] showed that d can be computed in 0{\p\) time 
using a precomputed table of size o(log W). The idea is that p^^^ is a log log log W- 
colored sequence and the number of all combinations of a logloglog VF-colored 
sequence of length log* VF + 11 is 2(*°s VK+ii)iogiogiogVK _ o{\ogW). Hence we 
can compute d for a IF-colored sequence in linear time using a precomputed 
table of size oilogW). □ 

Given a bit sequence d of Lemma [1] let Eblockd{p) be the function that 
decomposes an integer sequence p into a sequence qi,... ,qj of substrings called 
blocks ofp, such that p = qi ■ ■ ■ qj and qi is in the decomposition iff • • • qi-i\ + 
1] = 1 for any 1 < i < j. We omit d and write Eblockdip) when it is clear from 
the context, and we use implicitly the bit sequence created by Lemma [T] as d. 
Let \Ehlock{p)\ = j and let Eblock{s)[i] = qi. For a string s, let Epow{s) be the 
function which groups each maximal run of same characters a as a’', where r is 
the length of the run. Epow{s) can be computed in 0(|s|) time. Let \Epow{s)\ 
denote the number of maximal runs of same characters in s and let Epow{s)[i] 
denote i-th maximal run in s. 


Example 1 (Eblockd{p) and Epow{s)). Let log* W = 2, and then Al = 8, = 

4. 


lfp= 1,2,3,2,5,7,6,4,3,4,3,4,1,2,3,4,5 and d= 1,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0, 
then Eblockdip) = (1, 2, 3), (2, 5), (7, 6,4), (3,4, 3,4), (1, 2), (3,4, 5), \Eblockdip)\ = 

6 and Eblockdip)[2] = (2, 5). For string s = aabbbbbabb, Epow{s) = a^b^a}b^ and 
\Epow{s)\ = 4 and Epowis)[2] = 6®. 


2.2 Context free grammars as compressed representation of strings 

Admissible context free grammars. An admissible context free grammar 
(ACFG) [TB] is a CFG which generates only a single string. More formally, an 
ACFG that generates a single string T is a quadruple G = (A, V, 77,-S'), such 
that 

— A is an ordered alphabet of terminal characters, 

— V = {ei,...,efe} is a set of positive integers with ei < ••• < Cfc, called 
variables, 

— T) = {ci —>■ xexprj}^^^ is a set of deterministic productions (or assignments) 
i.e., for each variable e € V there is exactly one production in 27 whose 
lefthand side is e, 

— each a € V \ {S'} appears at least once in the righthand side of some pro¬ 
duction Cj xexpr^ with Ci < Cj, and 

— S G V is the start symbol which derives the string T. 

Sometimes we handle a variable sequence as a kind of string. For example, for 
any variable sequence y = Ci^ ■ ■ ■ Ci^ G V+, let |?/| = d and y[c\ = Ci^ for 1 < c < d. 
Let val : V —>■ A’*' be the function which returns the string derived by an input 
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variable. If s = val{e) for e G V, then we say that the variable e represents string 
s. For any variable sequence y G , let val^{y) = val{y[l]) ■ ■ ■ val{y[\y\]). 

For two variables ei, 62 G V, we say that ei occurs at position c in 62 if there 
is a node labeled with ei in the derivation tree of 62 and the leftmost leaf of 
the subtree rooted at that node labeled with ei is the c-th leaf in the derivation 
tree of 62- Furthermore, for variable sequence y G V"*", we say that y occurs at 
position c in e if y[i\ occurs at position c+|?;aZ^(y[..i —1])| in e for 1 < i < |y|. We 
define the function vOcc{ei, 62 ) which returns all positions of ei in the derivation 
tree of 62- 

Straight-line programs. A straight-line program {SLP) is an ACFG in the 
Chomsky normal from. Formally, SLP S of size n is an ACFG Q = (A, V, V, Xn), 
where V = {Xi, • • • , Xn}, val(Xn) = T, T) = {Xi —>■ expr^Jf^^ with each expr^ 
being either of form XiX^ (1 < £, r < i), or a single character a G X. The size of 
the SLP Q is the number n of productions in 7). In the extreme cases the length 
N of the string T can be as large as 2"“^, however, it is always the case that 
n > \ 0 g 2 N. For any variable Xi with Xi —>■ XgXr G V, let A^.left = val[Xi) 
and Ai.right = val{Xr), which are called the left string and the right string of 
Xi, respectively. 

Example 2 (SLP). Let 5 = {E,V,V,S) be the SLP s.t. A = {A,B,C}, V = 
{Xi, ■ ■ ■ , All}, = {Xi —^ A, X 2 —>■ B, A3 —^ C, A4 —>■ A3A1, A5 — >■ A4A2, Ag —>■ 
^5^5, A7 —^ A2A3,A8 —>■ AiA 2,A9 —>■ A7A8,Aio —>■ AgAgjAn —^ AipAg}, 

S = All, the derivation tree of S represents C ABC ABBC ABC ABC AB. 

Run-length ACFGs. We define run-length ACFGs as an extension to ACFGs, 
which allow run-length eneodings in the righthand sides of productions. Formally, 
a run-length ACFG is C/ = (A, V,!?, S), where V = {ei ^ xexpri}'(L^, val{S) = 

T and each xexpri is in one of the following forms: 

fa G A, 

xexpri = I eiCr G V+ (e^, < e^), 

{e‘^ G V X J\f {e < Ci, and d > 1). 

Hence xexpri G A U U (V x M). The size of the run-length AGFG Q is the 
number w of productions in B. 

Let Sigg : A U U (V x M) —>■ V be the function such that 

{ e if (e —>■ x) G V, 

Sigg{Sigg{x[l..\x\ — l])x[|x|]) if x[i] G V for 1 < z < \x\,2 < \x\ < 4, 
undefined otherwise. 

Namely, the function Sigg returns, if any, the lefthand side of the corresponding 
production for a given element in AU V’*' U (V x J\f) of length 3 or 4, by recursively 
applying the Sig function from left to right. Let Assgng be the function such that 
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Assgngici) = xexpn iff —?► xexprt e V. When clear from the context, we write 
Sigg and Assgug as Sig and Assgn, respectively. For any p G (Z'UV“''U(VxA/’))*, 
let Sig'^{p) = • • • 5j3(p[|p|]). We define the left and right strings for 

any variable Ci —>■ e^Cr G 2? in a similar way to SLPs. Furthermore, for any 
Ci —G V, let Ci-left = val{e) and e^.right = val{e)^~^. 

In this paper, we consider a DAG of size w that is a compact representation 
of the derivation trees of variables in a run-length ACFG Q, where each node 
represents a variable in V and out-going edges represent the assignments in TA. 
For example, if there exists an assignment —>■ ettr G TA, then there exist two 

out-going edges from to its ordered children et and e^. In addition, and 
have reversed edges to their parent e^. For any e G V, let parents{e) be the set 
of variables which have out-going edge to e in the DAG oiQ. If a node is labeled 
by e, then the node is associated with |?;a/(e)|. 

Example 3 (Run-length ACFG). Let Q = {E,V,TA, S) be a run-length ACFG, 
where A = {A,B,C}, V = 15}, V = {1 ^ A,2 ^ B,3 ^ 0,4 ^ 

3^, 5 1\ 6 2i, 7 3\ 8 (7,5), 9 (8,6), 10 (5,6), 11 (10,4), 12 

9^,13 ^ 10^14 111,15 (12,13), 16 (15,14), 17 16i}, and S = 

17. The derivation tree of the start symbol S represents a single string T = 
CABCABABABABABABABABABCCCC. Here, 4.1eft = val{3), 4.right = 
val'^{3^), Sig{(J, 5)) = 8, Sig{(J, 5, 6)) = 9, Sig{{6, 5)) = undefined, parents{5) = 
{8,10} and fOcc(9,17) = {1,4}. See also Fig. [1] in Section [3] which illustrates 
the derivation tree of the start symbol S and the DAG for Q. 

Dynamization and data structure of run-length ACFG. In this paper, 
we consider a compressed representation and compressed index of a dynamic 
text based on run-length ACFGs. Hence, upon edits on the text, the run-length 
ACFG representing the text needs to be modified as well. To this end, we consider 
dynamic run-length ACFGs, which allow for insertion of new assignments to T>, 
and allow for deletion of assignments e —>■ xexpr from TA only if |parents(e)| = 0. 
We remark that the grammar under modification may temporarily represents 
more than one text, however, this will be readily fixed as soon as we insert a 
new start symbol of the grammar representing the edited text. 

Next, we consider an abstract data structure /}^) to maintain a dy¬ 

namic run-length ACFG G = {F,V,T>, S) of size w. H consists of two compo¬ 
nents A and B. The first component A is an abstract data structure of 0{f'^) size 
which is able to add/remove an assignment to/from TA in 0(/^) time. This data 
structure is also able to compute Sig(xexpr) in time. For example, using 

a balanced binary search tree for TA, we achieve deterministic = 0(logw) 
time and = 0{w) space. Note that using the best known deterministic pre¬ 
decessor/successor data structure for a dynamic set of integers [4], we achieve 

deterministic U=0 (min { ) T^}) «■“ /d = 0(u,) 

space, where M is the maximum length of the dynamic texll^. The second compo- 

® Alstrup et al. [2] used hashing to maintain A and obtained a randomized 72(1, m) 
signature dictionary. However, since we are interested in the worst case time com- 






nent B is the DAG of Q introduced in the previous subsection. The corresponding 
nodes and edges of the DAG can be added/deleted in constant time per addi¬ 
tion/deletion of an assignment. By maintaining parents{e) with a doubly-linked 
list for each node Ve representing a variable e G V, we obtain the following lemma: 

Lemma 2. Using TLifA, /a) ® dynamic run-length ACFG Q = (A, V,!?, 5”) 

of current size w, Sig{xexpr) can be computed in 0{fA) time, for a given xexpr € 
A'UV^U(VxA/'). Given a node Ve representing a variable e, Assgn(e) and \val{e)\ 
can be computed in 0(1) time, and parents{e) can he computed in 0{\parents{e)\) 
time. We can also update 'H in 0(/^) time when an assignment is added to/removed 
from V. 

Note that Assgn(e), Sig(xexpr) and parentsie) can return not only the signatures 
but also the corresponding nodes in the DAG. 

3 Signature encoding 

In this section, we recall the signature encoding first proposed by Mehlhorn 
et al. [22]. The signature encoding of a string T is a run-length ACFG Q = 
(A, V, V, S) where the assignments in T) are determined by recursively applying 
to T the locally consistent parsing, the Encblock function, and the Sig function 
(recall Section |2|), until a single integer S is obtained. More formally, we use the 
Shrink and Pow functions in the signature encoding of string T defined below: 


c/, ■ J.T _ j for t = 0 

* \^Sig~^{Eblock{PowJ_i)) for 0 < t < h, 

Powf = Sig^{Epow{ShrinkJ)) for 0 < t < h, 

where h is the minimum integer satisfying \Pow'^\ = 1. Then, the start symbol 

of the signature encoding is S' = Pow]^, and the height of the derivation tree of 
the signature encoding of T is 0{h) = 0(\ogN), where N = |r| (see also Fig. |T] 
below). 

Example 4 (Signature encoding). Let Q = (A,V,2?, S) be a run-length ACFG 
of Example |3l Assuming Eblock{PowQ) = (7, 5,6), (7, 5, 6), (5, 6)^, (5, 6,4) and 
Eblock{PowJ) = (12,13,14) hold, G is the signature encoding of T and id(T) = 
17. See Fig.[T]for an illustration of the derivation tree of G and the corresponding 
DAG. 

Each variable of the signature encoding (the run-length ACFG defined this 
way) is called a signature. For any string P S A+, let id{T) = PowJ) = S, i.e., 
the integer S is the signature of T. 

plexities, we use balanced binary search trees or the data structure [4| in place of 
hashing. 
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The signature encoding of a text T can be efficiently maintained under in¬ 
sertion/deletion of arbitrary substrings to/from T. For this purpose, we use the 
data structure for the signature encoding of a dynamic text. 


3.1 Properties of signature encodings 

Here we describe a number of useful properties of signature encodings. The 
ones with references to the literature are known but we provide their proofs for 
completeness. The other ones without references are our new discoveries. 


Substring extraction. By the definition of the Eblock function and Lemma [U 
for any I < t < h, \ShrinkJ\ < \PowJ_i\/2 and \Powf\ < \Powf_i\/2. Thus 
h < log|s| and the height of the derivation tree of e is 0(log |?;aZ(e)|) for any 
signature e S V. Since each node of the DAG for a signature encoding stores the 
length of the corresponding string, we have the following: 

Fact 1 Using the DAG for a signature encoding Q = (A, V, D, S) of size w, given 
a signature e € V (and its corresponding node in the DAG), and two positive 
integers i, k, we can compute val{e)[i..i + /c — 1] in 0(log \val(e)\ + k) time. 


Space requirement of the signature encoding. Recall that we handle dy¬ 
namic text of length at most M. Then, the maximum value of the signatures is 
bounded by 3M — 1, since the derivation tree can contain at most M leaves, and 
2M — 1 internal nodes (when there are no runs of same signatures at any height 
of the derivation tree, Pow function generates as many signatures as the Shrink 
function). We also remark that the input of the Eblock function is a sequence 
of signatures. Hence, Al of Lemma [T] is bounded by log* 3M -I- 6 = 0(log* M). 
Note that we can bound M = 0(|T|) if we do not update Q after we compute 
id{T). 

Let N be the length of the current text T. The size w of the signature 
encoding of T is bounded by 3N — 1 by the same reasoning as above. Also, the 
following lemma shows that the signature encoding of T requires only compressed 
space: 

Lemma 3 ([26]). The sizew of the signature encoding ofT is 0(2 log iV log* M), 
where z is the number of faetors in the LZTl factorization without self-reference 
ofT. 

Proof. See Appendix [Al 

Hence, we have w = 0(min{2log log* M,iV}). In the sequel, we assume 
2 log N log* M < N and will simply write w = 0(2 log N log* M), since otherwise 
we can use some uncompressed dynamic text index in the literature. 
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Common sequences to all occurrences of same substrings. Here, we 
recall the most important property of the signature encoding. 

Let G = {S,V,'D,S) be the signature encoding of text T. Let i,j {i < j) be 
any positions in T, and let P = T[i..j]. Let and Pj be the paths from the 
root of the derivation tree of G to the Hh and jth leaves, respectively. Then, at 
each depth i of the derivation tree of G, consider the sequence of signatures 
which lie to the right of P^ with offset + 3 and to the left of P^- with offset 
Aji + 2. By the property of locally consistent parsing of Lemma[Tl Sij^i = 
for any occurrences [i'■■]'] of P in T and for any depth We call each signature 
contained in a consistent signature w.r.t. P. 

Formally, we define the consistent signatures of P in the derivation tree of G 
by the XShrink and XPow functions below, where the prefix of length at least A^ 
and the suffix of length at least Ar + 1 are “ignored” at each depth of recursion: 

Definition 3. For a string P, let 



for t = 0, 
for 0 <t < , 


XPowf = Sig+{Epow{XShrinkt[\L^\ + l..\XShrinkf\ - |Pf])|) for 0 < t < , 


where 


T 1C fh p ch nr-fp cf Tirpfi'r rtf IT nf Ipti nfh nf Jp/i cf /\ i- ci irh fh nf rlW T ,^ \ -L 



Note that Ar < \Lf\ < Ar + 3 and Ar + 1 < | < Ar + 4 hold by the 

definition. Hence \XShrinkfj^i\ > 0 holds if \Epow{XShrinkf)\ > Ar + Ar + 
9. See Fig. [2] for illustrations of consistent signatures of each occurrence of P 
in T, which are represented by the gray boxes. Since at each depth we have 
“ignored” the left and right contexts of respective length at most Ar + 3 and 
Ar + 4, the consistent signatures at each depth are determined only by the 
consistent signatures at the previous depth (1 level deeper). This implies that 
for any occurrences of P in T, there are common consistent signatures (gray 
boxes), which will simply be called the common signatures of P. The next lemma 
formalizes this argument. 

Lemma 4 (common sequences [26]). Let G = {X, V, V, S) be the signature 
encoding of text T and let P be any string. Then there exists a common seguence 
V = ei^... ,ed of signatures w.r.t. G which satisfies the following three conditions: 
(1) vak^fu) = P, (2) \Epow{v)\ = 0(log |P| log* M), and (3) for any e G V and 
integer i such that val{e)[i..i + |P| — V\ = P, v occurs at position i in e. 
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Proof. We consider the following short sequence Uniq{P) of signatures which 
represents P (see also Fig. [5]): 

Uniq{P) = ■■■Llp_^Llp_^XShrink^pR^p_^R^p_^---R^R^, 

where is the miniirrum integer such that \Epow{XShrink^p)\ < + Aji + 9. 

We show Lemma m using Uniq{P), namely, we show v = Uniq{P) satisfies all 
conditions (l)-(3). (1) This follows from Definition [3] (see also Fig. [2J2)). (2) 
\EpowiXShrmk^p)\ < Al + Ar + 9, < log|P|, |Lf| = 0(Z\i), |i?f| = OiAn) 

and \Epow{L^)\ = \Epow{R^)\ = 1 for 1 < t < . Hence \Epow{Uniq{P))\ = 

0(log |P| log* M). (3) For simplicity, here we oirly consider the case where val{e) = 
T, since other cases can be shown similarly. Consider any integer i with + 
|P| — 1] = P (see also Fig. [2](2)). Note that for 0 < t < if XShrink^ occurs 
in Shrink^ , then XPow^ always occurs in Pow^ , because XPow^ is determined 
only by XShrink^ . Similarly, for 0 < f < if XPowf_i occurs in PowJ_i, 
then XShrink^ always occurs in ShrinkJ. Since XShrink^ occurs at position i 
in Shrinkg , XShrinkf and XPow^ occur in the derivation tree of id{T). Hence 
we discuss the positions of XShrinkf and XPow^. Now, let Ct + 1 and Ct + 
1 be the beginning positions of the corresponding occurrence of XShrinkf in 
Shrink:j. and that of XPowf in Pow^ , respectively. Then Shrinkj [..c*] consists of 
PowJ_i[..Ct-i] and for 0 < t < h^. Also, Powfl-Ct] consists of ShrinkJ [..ct] 
and Pf for 0 < t < h^. This means that Uniq{P) occurs at position i in id{T). 

Therefore Lemma 2] holds. □ 

The sequence v of signatures in Lemma |4] is called a common sequence of P = 
val{e)[i..i + fc — 1] w.r.t. Q. Lemma 2] implies that any substring P of T can be 
represented by a sequence p of signatures with \Epow{p)\ = 0(log |P| log* M). 
The common sequences are conceptually equivalent to the cores |20j which are 
defined for the edit sensitive parsing of a text, a kind of locally consistent parsing 
of the text. 

The number of ancestors of nodes corresponding to Uniq{P) is upper bounded 
by the next lemma. 

Lemma 5. Let T and P be strings, and let P be the derivation tree of the 
signature encoding of T. Consider an oceurrence of P in T, and the induced 
subtree X of P whose root is the root of P and whose leaves are the parents of 
the nodes representing Uniq{P). Then X contains 0{log* M) nodes for every 
height and 0(log |r| + log |P| log* M) nodes in total. 

Proof. By Definition 21 for every height, X contains 0(log* M) nodes that are 
parents of the nodes representing Uniq{P). Lemma2]holds because the number 
of nodes at some height is halved when Shrink is applied. More precisely, consid¬ 
ering the X nodes of X at some height to which Shrink is applied, the number 
of their parents is at most {x + 2)/2. 

The next Lemma immediately follows from Lemma 21 which will be mainly 
used in the proof of Lemma 21 in Appendix and the proof of Lemma [9l 
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Lemma 6 . Let si,S 2 ,S 3 he any strings such that S 3 = S 1 S 2 , and let T be 
the derivation tree of id( 33 ). Consider the induced subtree X of'T whose root 
is the root of 'T and whose leaves are the parents of the nodes representing 
Uniq{si)Uniq{s 2 ) (see also Fig. [^4))- Then the size of X is 0(log |s 3 | log* M). 

The following lemma is about the computation of a common sequence of P. 

Lemma 7. Using the DAG for a signature encoding Q = S) of size w, 

given a signature e € V (and its corresponding node in the DAG) and two integers 
i and k, we can compute Epow{ Uniq{s[i..i + k— 1])) in 0(log |s| + log k log* M) 
time, where s = val{e). 

Proof. Let v be the common sequence of nodes which represents Uniq{s[i..i + k — 
1]) and occurs at position i in e. Starting at the given node in the DAG which 
corresponds to e, we compute the induced subtree which represents Uniq{s[i..i + 
k — 1]), rooted at the lowest common ancestor of the nodes in v. By Lemma [5l 
the size of this subtree is 0(log|s| +logfclog*M). We can obtain the root of 
this subtree in 0(log|s|) time from the node representing e. Hence Lemma [7] 
holds. □ 

The next lemma shows that we can compute LCE efficiently using the signa¬ 
ture encoding of the (dynamic) text. 

Lemma 8 . Using the DAG for a signature encoding Q = {E,V,D,S) of size 
w, we can support queries LCE(si, S 2 , j) and LCE(sf^, z, _)) in 0(log|si| -I- 
log |s 2 | -|- log £ log* M) time for given two signatures ei, 62 G V and two integers 
1 < i < |si|, 1 < j < |s 2 |, where si = val{ei), S 2 = val{e 2 ) and i is the answer 
to the LCE query. 

Proof. We focus on LCE(si, S 2 , L j) as LCE(sf^, z, j) is supported similarly. 

Let P denote the longest common prefix of si and S 2 - Our algorithm simul¬ 
taneously traverses two derivation trees rooted at ei and 62 and computes P 
by matching the common signatures greedily from left to right. Since Uniq{P) 
occurs at position i in ei and at position j in 62 by Lemma 01 we can compute P 
by at least finding the common sequence of nodes which represents Uniq{P), and 
hence, we only have to traverse ancestors of such nodes. By Lemma [5l the num¬ 
ber of nodes we traverse, which dominates the time complexity, is upper bounded 
by 0(log |si| -I- log |s 2 | -I- Epow(uniqP)) = 0(log |si| -b log |s 2 | + log£log* M). 

□ 


Construction Recall that a signature encoding Q generating a string T is 
represented and maintained by a data structure TL. We show how to construct 
an 'Hi\ogw,w) or for Q. It can be constructed from various types of 

inputs, such as (1) a plain (uncompressed) string T, (2) the LZ77 factorization of 
T, and (3) an SLP which represents T, as summarized by the following theorem. 
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Theorem 2. 1. Given a string T of length N, we can construct 'H{\ogw,w) 

for the signature encoding of size w which represents T in 0{N) time and 
working space, or HifA, f^) 0{Nf^) time and 0{f'_^ + w) working space. 

2. Given fi,..., fz LZ77 factors without self reference of size z representing T 

of length N, we can construct 'H{fAjfA) signature encoding of size 

w which represents T in 0(z/^log-/Vlog* M) time and 0{f'j^ + w) working 
space. 

3. Given an SLP S = {Xi expr^}2^^ of size n representing T of length N, 
we can construct T-H\ogw,w) for the signature encoding of size w which rep¬ 
resents T in 0(n log log n log iV log* M) time and 0(n\og* M + w) working 
space, or 'H{fA, f'A) ^ OInfA^ogNlog M) time and 0{f^ w) working 
space. 

Proof. See Appendix El 

In the static case, the M term of Theorem [2] can be replaced with N. 

Update In Section 01 we describe our dynamic index using 71 for a signature 
encoding Q generating a string T. For this end, we consider the following update 
operations for Q using H. 

— INSERT{Y,i): Given a string Y and an integer i, update TL. Updated TL 
handles a signature encoding Q generates T' = T[..i — I]y[I..|y|]T[b.]. 

- DELETE{i,k)-. Given two integers i,k, update TL. Updated TL handles a 
signature encoding Q generates T' = T[..i — l]T[i + k..]. 

During updates, a new assignment e xexpr is appended to Q whenever it is 
needed, in this paper, where e = max V +1 that has not been used as a signature. 
Specifically, we assign new signature to xeptr when Sig{xeptr) returns undefined 
for some form xeptr during updates. Also, updates may produce a redundant 
signature whose parents in the DAG are all removed. To keep Q admissible, we 
remove such redundant signatures from Q during updates. 

Lemma 9. Using 'H{fA, f'A) ® signature encoding Q = {E, V,2?, S) of size w 
which generates T, we can support INSERT(i, Y) and DELETE(i, k) in 0(/^(fc+ 
logAflog*M)) time, where |U| = k. 

Proof. We support DELETE{i,k) as follows: (I) Gompute a new start vari¬ 
able S' = id{T[..i — l]T[i..]) by recomputing the new signature encoding from 
Uniq{T[..i — I]) and Uniq{T[i k..]). This can be done in 0{fAlogNlog* M) 
time by Lemmas [7] and [H (2) Remove all redundant signatures Z from /^). 

Note that if a signature is redundant, then all the signatures along the path from 
S to it are also redundant. Hence, we can remove all redundant signatures effi¬ 
ciently by depth-first search starting from S, which takes 0{fA\Z\) time, where 
\Z\ = 0{k -\- log A^log* M) by Lemma[6l 

Similarly, we can compute INSERT operation in 0(/_4(|T| -|- log Af log* M)) 
time by creating S' using Uniq{T[..i — 1]), UniqiY) and Uniq{T[i -|- fc..]). Note 
that we can naively compute id{s) for a given string s £ E'^ in 0 (/_ 4 |s|) time. 
Therefore Lemma [9] holds. □ 
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4 Dynamic Compressed Index 

In this section, we present our dynamic compressed index based on signature 
encoding. As already mentioned in Section Tl-ll our strategy for pattern matching 
is different from that of Alstrup et al. [5] . It is rather similar to the one taken in 
the static index for SLPs of Claude and Navarro [5]. Besides applying their idea 
to run-length ACFGs, we show how to speed up pattern matching by utilizing 
the properties of signature encodings. 

The rest of this section is organized as follows: In Section 14.11 we briefly 
review the idea for the SLP index of Claude and Navarro [5]. In Section 321 we 
extend their idea to run-length ACFGs. In Section 331 we consider an index on 
signature encodings and improve the running time of pattern matching by using 
the properties of signature encodings. In Section [4.41 we show how to dynamize 
our index. 

4.1 Static Index for SLP 

We review how the index in [8] for SLP S generating a string T computes 
Occ{P, T) for a given string P. The key observation is that, any occurrence of P 
in T can be uniquely associated with the lowest node that covers the occurrence 
of P in the derivation tree. As the derivation tree is binary, if |P| > 1, then the 
node is labeled with some variable X G V such that Pi is a suffix of X.left and 
P 2 is a prefix of X.right, where P = P 1 P 2 with 1 < |Pi| < |P|. Here we call the 
pair {X, |Ar.left| — |Pi| -I- 1) a primary occurrence of P. Then, we can compute 
Occ{P,T) by first computing such a primary occurrence and enumerating the 
occurrences of X in the derivation tree. 

Formally, we define the primary occurrences of P as follows. 

Definition 4 (The set of primary occurrences of P). For a string P with 
|P| > 1 and an integer 1 < j < |P|, we define pOccg{P, j) and pOccg(P) as 
follows: 

pOccgiPJ) = {(W iWleftI -j + l)\XGV, 

P[..j] is a suffix of XAeit, P[j -|-1..] is a prefix 0 /X.right}, 
pOccs{P)= y pOccgiPJ), 

i<j<\p\ 

We call each element of pOccg{P) a primary occurrence of P. 

The set Occ{P, T) of occurrences of P in T is represented by pOecg{P) as follows. 


Observation 1 For any string P, 

Occ(P T) = ^ pOccg{P), k G vOcc{X, S')} if \P\ > 1, 

^ \vOcc{X,S)i{X ^ P) GV) if\P\ = i- 
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By Observation [U the task is to compute pOccg{P) and vOcc(X,S) effi¬ 
ciently. Note that vOcc{X,S) can be computed in 0{\vOcc{X, S)\h) time by 
traversing the DAG in a reversed direction (i.e., using parents(X) function re¬ 
cursively) from X to the source, where h is the height of the derivation tree of 
S. Hence, in what follows, we focus on how to compute pOccg(P) for a string P 
with |P| > 1. In order to compute pOccg{P, j), we use a data structure to solve 
the following problem: 

Problem 1 (Two-Dimensional Orthogonal Range Reporting Problem). Let X and 
y denote subsets of two ordered sets, and let 7^ C A x A’ be a set of points on 
the two-dimensional plane, where |A’|,|A’| G 0(|7^|). A data structure for this 
problem supports a query reportj^(xi, X 2 ,yi,y 2 )', given a rectangle (xi, X 2 , yi, 2 / 2 ) 
with xi,X 2 G X and yi, 2/2 £ 3^, returns {{x,y) & TZ \ Xi < x < X 2 ,yi < 2/ < 2/21- 

Data structures for Problem [T] are widely studied in computational geometry. 
There is even a dynamic variant, which we finally use for our dynamic index in 
Section lT^ Until then, we just use any static data structure that occupies 0(|7^|) 
space and supports queries in 0(g|7j| -I- 9 | 7 ?,|<ZOcc) time with = 0(log|7?.|), 
where qocc is the number of points to report. 

Now, given an SLP 5, we consider a two-dimensional plane defined hy X = 
{X.left^ \ X G V} and y = {X.right | X G V}, where elements in X and y are 
sorted by lexicographic order. Then consider a set of points P = { (X.left^, X.right) | 
X G V}. For a string P and an integer 1 < j < |P|, let (resp. denote 

the lexicographically smallest (resp. largest) element in y that has P[j + 1..] as 
a prefix. If there is no such element, it just returns NIL and we can immediately 
know that pOccg{P, j) = 0. We also define x[^’^'' and in a similar way 

over A, i.e., x[^’^'^ (resp. x^’^'^) is the lexicographically smallest (resp. largest) 
element in X that has P\..jY^ as a prefix. Then, pOccg{P, j) can be computed 
by a query report-j^{x^^'^\x^^'^\y^'^\y^'^^). See also Example [5] 

Example 5 (SLP). Let S be the SLP of Example [21 Then, 


X , X4 , X2 , x ^, X5, X9, Xq , xiQ, Xn ^ X3, xr }, 

y = {2/1,2/8,2/2,2/7,2/9,2/3,2/4,2/5,2/6,2/10,2/11}, 


where Xi = val{Xi)^, yi = val{Xi) for any Xi G V. Given a pattern P = SCAB, 
then pOccsiP. I) = {(Xe, 3), (Xn, 10)}, pOccgiP, 2) = {(Xg, 1)}, pOccs{P, 3) = 
(j), vOccs{Xq,S) = {1,11}, vOccs{Xg,S) = {7} and vOccs{Xii,S) = {1}. 
Hence pOccgiP) = {(Xg, 3), (Xn, 10), (Xg, 1)} and Occ(P,T) = {3,7,10,13}. 
See also Fig. |3 

We can get the following result: 

Lemma 10. For an SLP S of size n, there exists a data structure of size 0{n) 
that computes, given a string P, pOccg{P) in 0{\P\{h-\-\P\) \ogn-\-qn\pOccg{P)\) 
time. 
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Proof. For every 1 < j < |P|, we compute pOccg{P,j) hy report^^ ) ■ 
We can compute y^'^^ and in 0((/i+|P|) logn) time by binary search on 
where each comparison takes 0{h+\P\) time for expanding the first 0{\P\) char- 

(P i) (P i) 

acters of variables subjected to comparison. In a similar way, ' and X 2 ' 
can be computed in 0{{h + |P|) logn) time. Thus, the total time complexity is 
0{\P\{{h+\P\)\ogn+q„)+qn\pOccs{P)\) = 0{\P\{h+\P\)logn+qn\pOccs{P)\). 

□ 

4.2 Static Index for Run-length ACFG 

In this subsection, we extend the idea for the SLP index described in Section 03] 
to run-length ACFGs. Consider occurrences of string P with |P| > 1 in run- 
length ACFG G = (A, V, P, S) generating string T. The difference from SLPs is 
that we have to deal with occurrences of P that are covered by a node labeled 
with e —>■ but not covered by any single child of the node in the derivation 
tree. In such a case, there must exist P = P 1 P 2 with 1 < |Pi| < |P| such that 
Pi is a suffix of e.left = val~^{e) and P 2 is a prefix of e.right = val~^. Let 
j = |xaZ(e)| —|Pi|-|-l be a position in val~^{e‘^) where P occurs, then P also occurs 
at j-|-c|xaZ(e)| in val^{e'^) for every positive integer cwithj-|-c|xaZ(e)|-|-|P| — l < 

\val~^{e‘^)\. Remarking that we apply Definition 0] of primary occurrences to run- 
length ACFGs as they are, we formalize our observation to compute Occ{P,T) 
as follows: 

Observation 2 For any string P with |P| > 1, Occ{P,T) = {j -|- fc -|- c — 1 | 

(e, i) € pOcCg{P),c € {0} U Rung{e,j, |P|), k € vOcc{e, 5”)}, where 

J{c|xa/(e)| I 1 < c,j+c|xa/(e)|-b |P| - 1 < |xaZ+(e'^)|} ife^e^, 
Rung{e,j,\P\) = < 

I W otherwise. 

By the above observation, we can get the same result for a run-length ACFG as 
for an SLP in Lemma fTOl 

4.3 Static Index for Signature Encoding 

We can apply the result of Section 14.21 to signature encodings because signature 
encodings are run-length AGFGs, i.e., we can compute Occ{P,T) by querying 
reportj^(x[^’^\x^’^\y[^’^\y for “every” 1 < j < |P|. However, the proper¬ 
ties of signature encodings allow us to speed up pattern matching as summarized 
in the following two ideas: (1) We can efficiently compute and 

y 2 ^’^^ using LGE queries in compressed space (Lemma [TT]). (2) We can reduce 
the number of report^^ queries from 0(|P|) to 0(log |P| log* M) by using the 
property of the common sequence of P (Lemma 1121) . 

Lemma 11. Assume that we have the DAG for a signature encoding Q = (A, V, P, 5) 
of size w and X and y of G. Given a signature id{P) G V for a string P and 
an integer j, we can compute x[^’^\ X 2 ^’^\ y[^’^'^ and in 0(log i/;(log A-b 

log |P| log* M)) time. 
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Proof. By Lemma[5]and Fact[TJ we can compute and on X by binary 

search in 0(logu>(log-/V + log |P| log* M)) time. Similarly, we can compute 
and 2/2 ill tii6 same time. □ 

Lemma 12. Let P be a string with |P| > 1. If \Powq \ = 1, then pOccg{P) = 
pOccg{P,l). If\Powo \ > 1, then pOccg{P) =\Jjev pOccg{P^j), where u is the 
common sequence of P and V = {|t;aZ~'’(M[l..i])| | 1 < i < |u|, u[i] u[i + 1]}. 

Proof. If \Powq \ = 1, then P = for some character a G S. In this case, P 
must be contained in a node labeled with a signature e —^ such that e —>• a 
and d> |P|. Hence, all primary occurrences of P can be found by pOccg{P, 1). 

If \Powq\ > 1, we consider the common sequence u of P. Recall that u 
occurs at position j in e for any (e, j) £ pOce{P) by Lemma |H Hence at least 
pOeCg{P) = UiG-P' pOccg{P,i) holds, whereP' = {IfaZ'*’(«[!])|,..., |i'aZ''’(u[..|M| — 
I])|}. Moreover, we show that pOcCg{P, i) = 0 for any i gV with u[i] = u[i + 1]. 
Note that u[i] and u[i + 1] are encoded into the same signature in the derivation 
tree of e, and that the parent of two nodes corresponding to u[i] and u[i + 1] has 
a signature e' in the form e' —?► u[iY. Now assume for the sake of contradiction 
that e = e'. By the definition of the primary occurrences, i = 1 must hold, and 
hence, Shrink^ll] = it[I] G S. This means that P = it[I]l'^l, which contradicts 
\Powq\ > 1. Therefore the statement holds. □ 

Using Lemmas [H [71 [11] and [T^ we get the following lemma: 

Lemma 13. For a signature encoding Q, represented by P(/.a,/^), of size w 
which generates a text T of length N, there exists a data structure of size 0{w) 
that computes, given a string P, pOccg{P) in 0{\P\f^+\og w log |P| log* M(logN+ 
log \P\\og* M) + q.a,\pOccg{P)\) time. 

Proof. We focus on the case |Pok;^| > 1 as the other case is easier to be solved. 
We first compute the common sequence of P in 0(|P|/^) time. Taking V in 
Lemma [T^ we recall that |P| = 0(log |P| log* M) by Lemma [4] Then, in light 
of Lemma [T^ pOccg{P) can be obtained by |P| = 0(log |P| log* M) range 
reporting queries. For each query, we spend 0(log'u;(logiV + log |P| log* M)) 
time to compute x\^’P, X 2 ^’^\ y[^’^'^ and by Lemma ITT] Hence, the to¬ 

tal time complexity is 0(|P|/^ -I- log |P| log* M(logri;(logiV -|- log |P| log* M) + 
Qw) + qw\pOccs{P)\) = 0(|P|/.A + log w log |P| log* M(log -k log |P| log* M) + 
gu,|p0cc5(P)|). □ 

4.4 Dynamic Index for Signature Encoding 

In order to dynamize our static index in the previous subsection, we consider a 
data structure for “dynamic” two-dimensional orthogonal range reporting that 
can support the following update operations: 

- insertnip, Xpred, ypred)- given a point p = (x, y), Xpred = max{x' G X \ x' < 
x} and ypred = max{j/' G y \ y' < y}, insert p to TZ and update X and y 
accordingly. 
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— delete-jiip)- given a point p = (x, y) € TZ, delete p from TZ and update X and 
y accordingly. 

We use the following data structure: 

Lemma 14 (El)- There exists a data structure that supports reportii{xi,X 2 ,yi,y 2 ) 
in 0{log\TZ\ + occ{\og\TZ\/\oglog\TZ\)) time, and insertTiipx^j), deleteu{p) in 
amortized 0(log|72.|) time, where occ is the number of the elements to output. 
This structure uses 0{\TZ\) space. 0 

Now we are ready to prove Theorem [TJ 

Proof (Proof of Theorem]^. Our index consists of and a dynamic 

range reporting data structure A of Lemma [14] whose TZ is maintained as they 
are defined in the static version. We maintain X and y in two ways; self-balancing 
binary search trees for binary search, and Dietz and Sleator’s data structures 
for order maintenance. Then, primary occurrences of P can be computed as 
described in Lemma IT^ Adding the 0{occ log N) term for computing all pattern 
occurrences from primary occurrences, we get the time complexity for pattern 
matching in the statement. Concerning the update of our index, we described 
how to update Ti after INSERT and DELETE in LemmajS] What remains is to 
show how to update A when a signature is inserted into or deleted from V. When 
a signature e is deleted from V, we first locate e.left^ on X and e.right on y, 
and then execute delete 7 ?,(e.left^, e.right). When a signature e is inserted into V, 
we first locate Xpred = maxja;^ a X \ x' < e.left^} on X and ypred = max{y' £ 
y \ y' < e.right} on y, and then execute msert 7 j((e.lefte.right), Xpred, ypred)- 
The locating can be done by binary search on X and y in 0(log w log N log* M) 
time as Lemma [TTl In a single INSERT{i, Y) or DELETE(i, y) operation, 0{y + 
log N log* M) signatures are inserted into or deleted from V, where |y| = y. 
Hence we get Theorem [T] □ 

5 Applications 

In this section, we present a number of applications of the data structures of 
Sections O and m Theorems O and 01 are applications to text compression. 

Theorem 3. Given a string T of length N, we can compute the LZ77 Eactoriza- 
tion ofT in 0{Nf^ + zlogwlog^ iV(log* N)'^) time and 0{w + f'^) working space 
using TLifA^fjC) for a signature encoding of size w which generates T, where z 
is the size of the LZ77 factorization ofT and w = 0{zlogNlog* N). 

^ The original problem considers a real plane in the paper [7], however, his solution 
only need to compare any two elements in TZ in constant time. Hence his solution 
can apply to our range reporting problem by maintains X and y using the data 
structure of order maintenance problem proposed by Dietz and Sleator [5], which 
enables us to compare any two elements in a list L and insert/delete an element 
to/from L in constant time. 
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Theorem 4. (1) Given an I'a) /o*' ® signature encoding Q = [E, VjV, S) 

of size w which generates T, we can compute an SLP S of size 0{w log |r|) gen¬ 
erating T in 0{w log |T|) time. (2) Let us conduct a single INSERT or DELETE 
operation on the string T generated hy the SLP of (1). Let y be the length of the 
substring to be inserted or deleted, and let T' be the resulting string. During the 
above operation on the string, we can update, in 0{{y + log |T'| log* M )(/_4 + 
logjT'l)) time, the SLP of (1) to an SLP S' of size 0(w'log|T'|) which gener¬ 
ates T', where M is the maximum length of the dynamic text, w' is the size of 
updated Q which generates T'. 

Theorems [MSI are applications to compressed string processing (CSP), where 
the task is to process a given compressed representation of string(s) without 
explicit decompression. 

Theorem 5. Given an SLP S of size n generating a string of length N, we can 
construct, in 0(n log log n log iV log* A^) time, a data structure which occupies 
0{n\ogN\og* N) space and supports LCP{Xi, Xj) and LCS(Xi,Xj) queries for 
variables Xi,Xj in 0(logN) time. The LCP{Xi, Xj) and LCS(Xi,Xj) query 
times can be improved to 0(1) using 0(n log n log iV log* IV) preprocessing time. 

Theorem 6. Given an SLP S of size n generating a string T of length N, 
there is a data structure which occupies 0{w + n) space and supports queries 
LCE{Xi,Xj,a,b) for variables Xi,Xj, 1 < a < \Xi\ and 1 < b < \Xj\ in 
0(logiV + logt'log* iV) time, where w = 0{z\ogN\og* N). The data struc¬ 
ture can be constructed in 0(n log log n log iV log* A^) preprocessing time and 
0{n log* N -\-w) working space, where z < n is the size of the LZ77 factorization 
of T and I is the answer of LGE query. 

Let h be the height of the derivation tree of a given SLP S. Note that h > 
log Af. Matsubara et al. showed an 0{nh{n-\-h\og Af))-time 0(n(n +log N))- 
space algorithm to compute an Oin log Af)-size representation of all palindromes 
in the string. Their algorithm uses a data structure which supports in Oihf) 
time, LCE queries of a special form LCP.{Xi,Xj, l,Pj) [23]. This data structure 
takes 0(nf) space and can be constructed in 0{n^h) time [19]. Using Theorem|6l 
we obtain a faster algorithm, as follows: 

Theorem 7. Given an SLP of size n generating a string of length N, we can 
compute an 0(n\og N)-size representation of all palindromes in the string in 
0{nlog^ Nlog* N) time and 0(nlog* N -\-w) space. 

A non-empty string s is called a Lyndon word if s is the lexicographically 
smallest suffix of s. The Lyndon factorization of a non-empty string w is a 
sequence of pairs i\fi\,Pi) where each fi is a Lyndon word and pt is a positive 
integer such that w = /f^ • • • f^ and fi-\ is lexicographically smaller than fi 
for all 1 < i < m. I et al. m proposed a Lyndon factorization algorithm running 
in 0{nh{n -\- lognlogAf)) time and 0{n'^) space. Their algorithm use the LCE 
data structure on SLPs [T^] which requires 0{n'^h) preprocessing time, 0{n^) 
working space, and 0{h\og N) time for LCE queries. We can obtain a faster 
algorithm using Theorem [5] 
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Theorem 8. Given an SLP of size n generating a string of length N, we can 
compute the Lyndon factorization of the string in 0{n(n + log n log iV log* N)) 
time and 0{n^ + z log log* N)) space. 

We can also solve the grammar compressed dictionary matching problem m 
with our data structures. We preprocess an input dictionary SLP (DSLP) (5, m) 
with n productions that represent m patterns. Given an uncompressed text T, 
the task is to output all occurrences of the patterns in T. 

Theorem 9. Given a DSLP {S, m) of size n that represents a dictionary Lti^s.m) 
for m patterns of total length N, we can preprocess the DSLP in 0((nloglogn + 
mlogm) log A^log* iV) time and 0{nlogN\og* N) space so that, given any text 
T in a streaming fashion, we can detect all occ occurrences of the patterns in T 
in O{\T\ log m log N log* N + occ) time. 

It was shown in m that we can construct in 0{n'^ log n) time a data structure 
of size 0(n^ logiV) which finds all occurrences of the patterns in T in 0{\T\{h + 
m)) time, where h is the height of the derivation tree of DSLP {S,m). Note 
that our data structure of Theorem M is always smaller, and runs faster when 
h = a;(logmlog A^log* N). 

Acknowledgments. We would like to thank Pawel Gawrychowski for drawing 
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Fig. 1. The derivation tree of S (left) and the DAG for Q (right) of Example |3l In 
the DAG, the black and red arrows represent e —>■ eiCr and e —>■ respectively. In 
Example m T is encoded by signature encoding. In the derivation tree of S, the dotted 
boxes represent the blocks created by the Eblock function. 
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Fig. 2. Abstract images of consistent signatures of substring P of text T, on the 
derivation trees of the signature encoding of T. Gray rectangles in Figures (l)-(3) 
represent consistent signature sequences for occurrences of P. (1) Each XShrinkf and 
XPowf occur on substring P in shrinkj and Powf, respectively, where T = LPR. (2) 
The substring P can be represented by Lq Lq Lf Li XShrink^Ri Rf Rq Ro ■ (3) There 
exist common signatures on every substring P in the derivation tree. (4) The derivation 
tree of id{s 3 ) and the subtree X in the proof of Lemma [6] 
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A Appendix : Theorem [2] 

A.l Proof of Theorem (1) 

'H(log to, to) construction in 0{N) time and space. Our algorithm com¬ 
putes signatures level by level, i.e., constructs incrementally Shrink q , Powq , 

..., ShrinkJ^, PowJ^. For each level, we determine signatures by sorting signa¬ 
ture blocks (or run-length encoded signatures) to which we give signatures. The 
following two lemmas describe the procedure. 

Lemma 15. Given Eblock(PowJ_i) for 0 < t < h, we can compute ShrinkJ in 
0{{h — a) + \PowJ_i\) time and space, where b is the maximum integer in PowJ_i 
and a is the minimum integer in PowJ_i. 

Proof. Since we assign signatures to signature blocks and run-length signatures 
in the derivation tree of S in the order they appear in the signature encoding. 
PowJ_i[i] — a fits in an entry of a bucket of size b— a for each element of PowJ_i[i] 
of Powf_i. Also, the length of each block is at most four. Hence we can sort all 
the blocks of Eblock{PowJ_i) by bucket sort in 0{{b — a) + \Powf_i\) time and 
space. Since Sig is an injection and since we process the levels in increasing order, 
for any two different levels 0 < t' < t < h, no elements of ShrinkJ_i appear in 
ShrinkJ,_i, and hence no elements of PowJ_i appear in PowJ,_i. Thus, we can 
determine a new signature for each block in Eblock{PowJ_^), without searching 
existing signatures in the lower levels. This completes the proof. 

Lemma 16. Given Epow{ShrinkJ), we can compute PowJ in 0{x -I- (6 — a) -I- 
\Epow{ShrinkJ)\) time and space, where x is the maximum length of runs in 
Epow{Shrink^), b is the maximum integer in Powf_i, and a is the minimum 
integer in PowJ_i. 

Proof. We hrst sort all the elements of Epow{ShrinkJ) by bucket sort in 0{b — 
a+ \Epow{ShrinkJ)\) time and space, ignoring the powers of runs. Then, for each 
integer r appearing in ShrinkJ , we sort the runs of r’s by bucket sort with a 
bucket of size x. This takes a total of 0{x+ \Epow{Shrink^ )\) time and space for 
all integers appearing in ShrinkJ . The rest is the same as the proof of Lemma [T5l 

The next lemma shows how to construct 'H(logw,w) from a sorted assign¬ 
ment set V oi Q. 

Lemma 17. Given a sorted assignment sefD ofQ, we can construct'H{\ogw,w) 
of Q in 0(1 V|) time. 

Proof. Recall that H consists of A and DAG B. Clearly, given a sorted assign¬ 
ment set T), we can construct B in linear time and space. Also, we can construct, 
in linear time and space, a balanced search tree for A from 7). Hence Lemma ITTI 
holds. 

We are ready to prove the theorem. 


27 


Proof. In the derivation tree of id{T)^ since the number of nodes in some level 
is halved when going up two levels higher, every node of Since the size of the 
derivation tree of id{T) is 0{N), by Lemmas [U [151 and fTOl we can compute 
id{T) and a sorted assignment set 2? of C/ in 0{N) time and space. Finally, by 
Lemma flTl we can get 'H{\ogw,w) for Q in 0{N) time. 


f'_^) construction in 0{NfX) time and + working space. 

Proof. Note that we can naively compute id{T) for a given string T in 0{NfX) 
time and 0{N) working space. In order to reduce the working space, we consider 
factorizing T into blocks of size B and processing them incrementally: Starting 
with the empty signature encoding Q, we can compute id{T) in fj,{\og N log* M+ 
B)) time and 0{w + i? + f'X working space by using INSERT{T[{i — 1)B + 
I..ii3], {i — 1)B + I) for i = 1,..., ^ in increasing order. Hence our proof is 
hnished by choosing B = log log* M. 

A.2 Proof of Theorem (2) 

Proof. Consider T-Li^fj^, f'X for an empty signature encodings Q. If we can com¬ 
pute INSERT{Y,i) operation in 0(/^ log Alog* M) time, then Theorem [5] (2) 
immediately holds by computing INSERT{fi,\fi ■■■ fi-i\ + I) for 1 < i < z 
incrementally. By the proof of Lemma [9l we can compute INSERT(Y,i) for 
a given Uniq{Y) in 0(/^ log Alog* M) time. We can compute each Uniq{fi) 
in 0(logAlog*M) time by Lemma [7] because fi occurs previously in T when 
\fi\ > 1. Hence we get Theorem [2] (2). 

Note that we can directly show LemmaOfrom the above proof because the size of 
G increases 0(log Alog* M) by Lemma[6l every time we do INSERT{fi, |/i • • • fi-i\ + 
I) for I < i < z. 

A.3 Proof of Theorem (3) 

HifA.ifX) construction in 0(n/x log TV log* TW) time and 0{f'_^ + w) 
working space. 

Proof. We can construct f'X by 0{n) INSERT operations as the proof of 

Theorem!^ (2). 


'H(log w, w) construction in 0{n log log n log TV log* TVf) time and 0(n log* TVf-|- 
w) working space. In this section, we sometimes abbreviate val{X) as X for 
X G S. For example, Shrinkf and Pow^ represents Shrinkf°'''^^^ and 
respectively. 

Our algorithm computes signatures level by level, i.e., constructs incremen¬ 
tally Shrinkg"^, Pow^’', ..., Shrink^", Pow^". Like the algorithm described in 
Section lA.H we can create signatures by sorting blocks of signatures or run- 
length encoded signatures in the same level. The main difference is that we now 
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utilize the structure of the SLP, which allows us to do the task efficiently in 
0(nlog* M + w) working space. In particular, although \Shrinkf''\, \Powf"' \ = 

0{N) for 0 < t < h, they can be represented in 0(nlog* M) space. 

In so doing, we introduce some additional notations relating to XShrink^ 
and XPow^ in Definition [3] By Lemma lU for any string P = Pi P 2 the following 
equation holds: 

XShrinkf = for 0 < t 

XPoWt = for 0 < t < 

where we define yf and yf for a string P as follows: 

f XShrinkf for 0 < t < , 

I e for t > h^, 

P J XPowf for 0 < t < h^, 

|e for t > . 

For any variable Xi —>• X^Xr, we denote zf* = ^fQj. q < 

t < and 0 < t < h'’o.i{Xi)y 

\z^'\, \z^' I = 0(log* N) by LemmaO We can use z ^^,..., zf" (resp. z ^^,..., zf") 
as a compressed representation of XShrinkf"- (resp. XPowf") based on the SLP: 
Intuitively, zf" (resp. zf") covers the middle part of XShrinkf" (resp. XPowf") 
and the remaining part is recovered by investigating the left/right child recur¬ 
sively (see also Figure. a. Hence, with the DAG structure of the SLP, XShrinkf" 
and XPowf" can be represented in 0(n log* XI) space. 

In addition, we define Af, Bf, Af and as follows: For 0 < t < , Af 

(resp. Bf) is a prefix (resp. suffix) of Shrinkf which consists of signatures of 
Af_iLf_i (resp. Rf_iBf_i)- and for 0 < t < h^, Af (resp. Bf) is a prefix (resp. 
suffix) of Powf which consists of signatures of Af Lf (resp. Rf Bf). By the defi¬ 
nition, Shrink f = Af XShrinkf Bf for 0 <t < h^, and PoWf = Af XPowf Bf 
for 0 < t < h^. See Figure [5] for the illustration. 

Since Shrinkf" = Af" XShrinkf" Bf" for 0 < t < h ^", we use At = 

{zt ,..., zf", Af" ,Bf") as a compressed representation of Shrinkf" of size 
0{n log* M). Similarly, for 0 < t < , we use At = {zf ^,..., zf ", Af ", Bf") 

as a compressed representation of Powf" of size 0(n log* M). 

Our algorithm computes incrementally Aq, Ai,..., . Note that, given 

Af^Xn, we can easily get Powfxn of OiXog* M) in 0(n log* M) time, and 
then id{val{Xn)) in 0(log* M) time from Powfx^. Hence, in the following three 
lemmas, we show how to compute Aq, Ai,..., Ai^x„. 

Lemma 18. Given an SLP of size n, we can compute Aq in 0{n log log n log* M) 
time and 0(n log* M) space. 

Proof. We first compute, for all variables Xi , Epow {XShrinkf') if |Epow {XShrinkf') \ < 
Al + Ap + 9, otherwise Epow{Lf') and Epow{Rf‘). The information can be 
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computed in 0{n log* M) time and space in a bottom-up manner, i.e., by process¬ 
ing variables in increasing order. For Xi —^ X^Xr, if both \Epow{XShrinkQ‘)\ 
and \Epow{XShrink^"')\ are no greater than Al + Ah + 9, we can compute 
Epow{XShrink^') in 0(log* M) time by naively concatenating Epow{XShrink^‘^) 
and Epow{XShrink^''). Otherwise \Epow{XShrink^^)\ > Al-\-Ah-\-^ must hold, 
and Epow{L^') and Epow{R^*) can be computed in 0(1) time from the infor¬ 
mation for X^ and X^.. 

The run-length encoded signatures represented by Zq' can be obtained by 
using the above information for Xi and Xr in 0(log* M) time: z^' is created over 
run-length encoded signatures Epow{XShrinkQ^) (or Epow{Rq^)) followed by 
Epow{XShrink (or Epow{Rq^)). Also, by definition and represents 
Epow^L^"-) and Epow{Rq^), respectively. 

Hence, we can compute in 0(n log* M) time 0(n log* M) run-length encoded 
signatures to which we give signatures. We determine signatures by sorting the 
run-length encoded signatures as Lemma ITbl However, in contrast to Lemma ITbl 
we do not use bucket sort for sorting the powers of runs because the maxi¬ 
mum length of runs could be as large as N and we cannot afford 0{N) space 
for buckets. Instead, we use the sorting algorithm of Han [12] which sorts x 
integers in 0(a;loglogx) time and 0{x) space. Hence, we can compute Aq in 
0(nloglognlog* M) time and 0(n log* M) space. 

Lemma 19. Given At, we ean compute At in 0(n log log n log* M) time and 
0(n log* M) space. 

Proof. The computation process is similar to that of Lemma fTSl except that we 
also use the information in At. 
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Fig. 5. An abstract image of Shrink^ and Powf for a string P. For 0 <t < , Af Lf 
(resp. Rf Bf) is encoded into (resp. Similarly, for 0 < t < , Af Lf (resp. 

RfBf) is encoded into Af (resp. Bf). 


We first compute, for all variables Xi, Epow{XShrinkf') ii\Epow{XShrinkf')\ < 

Al + An + 9, otherwise Epow(L^') and Epo'w{R^'). The information can be 
computed in 0{n log* M) time and space in a bottom-up manner, i.e., by process¬ 
ing variables in increasing order. For Xi —?► X^X^, if both \Epo'w{XShrmk^‘)\ 
and \Epow{XShrmkf^)\ are no greater than A^ + An + 9, we can compute 
Epow{XEkrinkg') in 0(log* M) time by naively concatenating Epow{XShrinkf^), 
Epow\zf^') and Epow{XShrink^^). Otherwise \Epow{XShrink^')\ > An+AnP^ 
must hold, and Epow{L^^) and Epow^Rg') can be computed in 0(1) time from 
Epow{zf') and the information for Xi and Xr- 

The run-length encoded signatures represented by zf' can be obtained in 
0(log* M) time by using zf^‘ and the above information for Xg and Xr'- zf^' is 
created over run-length encoded signatures that are obtained by concatenating 
Epow{XShrinkQ^) (or Epow{Rq^))^ z^' and Epow{XShrink q"') (or Epow^Rg'')). 

Also, and represents and , respectively. 

Hence, we can compute in 0(n log* M) time Oin log* XI) run-length encoded 
signatures to which we give signatures. We determine signatures in 0[n log log n log* XI) 
time by sorting the run-length encoded signatures as Lemma IT^ 

Lemma 20. Given At, we can compute At+i in Oin log* XI) time and 0{n log* XI) 
space. 

Proof. In order to compute zff^ for a variable Xi XgXr, we need a signature 
sequence on which is created, as well as its context, i.e., An signatures to 
the left and An to the right. To be precise, the needed signature sequence is 
zf'uf'"', where Uf ^ (resp. Vt ^) denotes a prefix (resp. suffix) of Pt ^ of length 
An + An + 4 for any variable Xj (see also Figure |6|). Also, we need Atuf"- and 
uf'^Bt to create and B^Ifg, respectively. 

Note that by Definition |3 \zA\ > An + An -I- 9 if e. Then, we can 

compute uf' for all variables Xi in 0(n log* M) time and space by processing 
variables in increasing order on the basis of the following fact: if 
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Fig. 6. Abstract images of the needed signature sequence zf'uf’’ and 
are not shown when they are empty) for computing in three situations: Top for 
0 < t < ; middle for < t < ; and bottom for < t < h^* ■ 


^ e, otherwise uf’ is the prefix of zf’ of length Al + Ar + 4. Similarly vf'' 
for all variables Xi can be computed in 0(n log* AI) time and space. 

Using and v^' for all variables Xi, we can obtain 0(nlog* M) blocks of 
signatures to which we give signatures. We determine signatures by sorting the 
blocks by bucket sort as Lemma ITSl in 0(n log* M) time. 

Hence, we can compute rlt+i in 0(n log* M) time and 0(nlog* M) space. 

We are ready to prove the theorem. 

Proof. Using Lemmas fT^ [T^ andlSOl we can get in 0{n log log n log N log* M) 
time by computing Aq, Ai,..., Ai^x„ incrementally. Note that during the compu¬ 
tation we only have to keep At (or At) for the current t and the assignments of G- 
Hence the working space is 0{n log* M + w). By processing in 0{n log* M) 
time, we can get a sorted assignment set 2? of C/ of size 0{w). Finally, we process 
G in 0{w) time and space to get 'H{\ogw,w) by Lemma ITTI 

B Appendix: Applications 

B.l Proof of Theorem [3] 

For integers j, k with 1 < j < j -b fc — 1 < let Fst{j, k) be the function which 
returns the minimum integer i such that i < j and T[i..i + k—l] = T\j..j -|-fc — 1], 
if it exists. Our algorithm is based on the following fact: 

Fact 2 Let /i,..., A be the LZ77-Factorization of a string T. Given fi,, fi-i, 
we can compute ft with 0(log|/i|) calls of Fst{j,k) (by doubling the value of k, 
followed by a binary search), where j = \fi - ■ ■ fi-i\ + 1- 

We explain how to support queries Fst(j,k) using the signature encoding. 
We define e.min = min vOcc(e, S) -b |e.left| for a signature e € V with e —> e^er 


32 












or e —>■ e^. We also define FstOcc{P,i) for a string P and an integer i as follows: 

FstOcc{P,i) = niin{e.niin | (e,i) € pOccg{P,i)} 

Then Fst{j,k) can be represented by FstOcc{P,i) as follows: 

Fst{j, k) = mm{FstOcc{T[j..j + fc — 1], i) — z | z € {1, ... ,k — 1} 

= T[im{FstOcc{T[j..j + k — l],i) — i \ i € V}, 

where V is the set of integers in Lemma IT^ with P = T[j..j + fc — 1], 

Recall that in Section 14.31 we considered the two-dimensional orthogonal 
range reporting problem to enumerate pOccg{P,i). Note that FstOcc{P,i) can 
be obtained by taking (e, i) G pOccg{P, i) with e.min minimum. In order to com¬ 
pute FstOcc{P,i) efficiently instead of enumerating all elements in pOccg{P,i), 
we give every point corresponding to e the weight e.min and use the next data 
structure to compute a point with the minimum weight in a given rectangle. 

Lemma 21 f flj'). Consider n weighted points on a two-dimensional plane. There 
exists a data structure which supports the query to return a point with the min¬ 
imum weight in a given rectangle in 0(log^ n) time, occupies 0(n) space, and 
requires O(nlogn) time to construct. 

Using Lemma [211 "we get the following lemma. 

Lemma 22. Given a signature encoding Q = {S, V, P, S) of size w which gener¬ 
ates T, we can construct a data structure of 0(w) space in 0(w log w log N log* N) 
time to support queries Fst{j,k) in 0{\ogwlogklog* NflogN -\-logklog* N)) 
time. 

Proof. For construction, we first compute e.min in 0{w) time using the DAG 
of G- Next, we prepare the plane defined by the two ordered sets X and y in 
Section lT3l This can be done in 0{w log w log N log* N) time by sorting elements 
in X (and 3^) by LCE algorithm (Lemma [5]) and a standard comparison-based 
sorting. Finally we build the data structure of Lemma ITU in 0{w log w) time. 

To support a query Fst(j,k), we first compute Epow{Uniq(P)) with P = 

T[j..j -I- fc — 1] in OflogN -I- log fc log* N) time by Lemma [71 and then get V in 
LemmaITT] Since \V\ = 0(log fc log* M) by LemmajH Fst{j, fc) = TXm{FstOcc{P, i) — 
i \ i G V} can be computed by answering FstOcc 0(log fc log* M) times. For each 
computation of FstOcc{P,i), we spend 0(logic(logiV -|- logfclog*iV)) time to 

compute and by Lemma (TTl and 0(log^ w) time to com- 

(Pi) (Pi) (Pi) (Pi)\ 

pute a point with the minimum weight in the rectangle (x) ,X 2 ,y\ y 2/2 )■ 

Hence it takes 0(log fclog* M(logii;(logiV-|-logfc log* iV)-f-log^ w)) = 0(logiclog fc log* iV(log A^-b 
logfclog* N)) time in total. 

We are ready to prove Theorem [3] holds. 

Proof (Proof of Theorem\^. We first compute the signature encoding of T in 
0(|T|/^) time and 0{f(^ -\-w) working space by the algorithm of Theorem [2] (1). 
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Using a data structure Hi f a, fU) achieving /^ = O (min { \/ logTog^; }) 

time and = 0{w) space, the working space becomes 0{w) space. Next we 
compnte the z factors of the LZ77-Factorization of T incrementally by using 
Fact [2] and Lemma in 0(z logw log^ A^(log* A^)^) time. Therefore the state¬ 
ment holds. 


B.2 Proof of Theorem [ 4 ] 

Proof of Theorem S (1) 

Proof. For any signature e € V such that e —^ e^e,., we can easily translate e 
to a production of SLPs because the assignment is a pair of signatures, like the 
right-hand side of the production rules of SLPs. For any signature e € V such 
that e —>■ e^', we can translate e to at most 21ogfc production rules of SLPs: We 
create t = [log k\ variables which represent ,..., and concatenating 

them according to the binary representation of k to make up k e’s. Therefore we 
can compute S in 0(wlog |r|) time. 


Proof of Theorem (2) 

Proof. Note that the number of created or removed signatures in V is bounded 
by 0{y + log \T'\ log* M) by Lemma ID For each of the removed signatures, we 
remove the corresponding production from S. For each of created signatures, 
we create the corresponding production and add it to S as in the proof of (1). 
Therefore Theorem 0] holds. □ 

B.3 Proof of Theorem 

We use the following known result. 

Lemma 23 (El)- Using the DAG for a signature eneoding Q = {S,V,D, S), 
we can support 

- LCP{si,S 2 ) m 0(log |si|-I- log |s 2 |) time, 

- LCS{si,S 2 ) ot 0((log |si| -I-log |s 2|) log* M) time 

where id{si), id{s 2 ) € V. 

Proof. We compute LCP{si,S 2 ) by LCE{si, S 2 ,t,l), namely, we use the algo¬ 
rithm of LemmalU Let P denote the longest common prefix of si and S2. We use 
the notation defined in Section lA[3l There exists a signature sequence v = 
A^pXShrink^pR^p_.^R^p_^-■ ■ Rq Rq that occurs at position 1 in id{si) and 
id{s 2 ) by a similar argument of LemmaH) Since \Epow{v)\ = 0(log |P|-f log* M), 
we can compute LCP{si, S 2 ) in 0(log |si| -t-log |s2|) time. Similarly, we can com¬ 
pute LCS{si,S 2 ) in 0((log |si| -t-log |s2|) log* M) time. More detailed proofs can 
be found at [2]. □ 
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To use Lemma [22] for id{val{Xi)),..., id{val{Xn)), we show the following 
lemma. 

Lemma 24. Given an SLP S, we can compute id{val{Xi )),..., id{val{Xn)) in 
0(n log log n log iV log* iV) time and 0{n\ogN log* N) space. 

Proof. Recall that the algorithm of Theorem |2] (3) computes id{val{Xn)) in 

0(n log log n log iV log* N) time. We can modify the algorithm to compute id{val{Xi)),..., id{val{Xn)) 

without changing the time complexity: We just compute , Af -, and Bf 

for “all” X £ S, not only for X„. Since the total size is 0(n log log* A^), 

Lemma [24] holds. 

We are ready to prove Theorem [S] 

Proof. The first result immediately follows from Lemma [23] and |24l To speed¬ 
up query times for LCP and LCS, We sort variables in lexicographical order in 
0{nlognlogN) time by LCP query and a standard comparison-based sorting. 

Constant-time LCP queries are then possible by using a constant-time RMQ data 
structure [33] on the sequence of the Icp values. LCS queries can be supported 
similarly. □ 


B.4 Proof of Theorem |6| 

Proof. We can compute 'H(login, in) for a signature encoding Q = {S,V,'D, S) 
of size w representing T in 0{n log log n log N log* N) time and 0{n log* M + w) 
working space using Theorem|21 where w = 0 ( 2 ; log A^log* N). Notice that each 
variable of the SLP appears at least once in the derivation tree of T„ of the last 
variable Xn representing the string T. Hence, if we store an occurrence of each 
variable Xi in Bn and \ val{Xi)\, we can reduce any LCE query on two variables 
to an LCE query on two positions of val{Xn) = T. In so doing, for all 1 < i < n 
we first compute \ val{Xi)\ and then compute the leftmost occurrence ^i of Xi in 
7 ) 1 , spending 0{n) total time and space. By Lemma HI each LCE query can be 
supported in OilogN -flog flog* N) time. Since z <n [34], the total preprocess¬ 
ing time is 0{n log log n log N log* N) and working space is 0{n log* M + w). □ 

B.5 Proof of Theorem 0 

Proof. Eor a given SLP of size n representing a string of length N, let P{n, N), 
S(n,N), and E(n,N) be the preprocessing time and space requirement for an 
LCE data structure on SLP variables, and each LCE query time, respectively. 

Matsubara et al. m showed that we can compute an 0(n log Af)-size rep¬ 
resentation of all palindromes in the string in 0{P{n, N) + E{n,N) ■ nlogN) 
time and 0(n log A^ + S{n,N)) space. Hence, using Theorem jb] we can find 
all palindromes in the string in 0(n log log n log A^ log* A^ + nlog^ N log* N) = 
0{n log^ N log* N) time and 0{n log* N + w) space. □ 
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B.6 Proof of Theorem [8] 


Proof. It is shown in m that we can compute the Lyndon factorization of the 
string in 0{P{n, N) + E{n, N)-n\ogn) time using 0{nf + S{n, N)) space. Hence, 
using Theorem [51 we can compute the Lyndon factorization of the string in 
0{n log log n log N log* N + n log n log N log* N) = 0{n log n log N log* N) time. 

We remark that since m < n due to m, the output size m is omitted in the 
total time complexity. □ 

B.7 Proof of Theorem [9] 

Proof. In the preprocessing phase, we construct an Hflogw', w') for a signature 
encoding Q = {E, V, V, S) of size w' such that 

id{val{Xi )),..., id{val{Xn)) G V using Lemmal^ spending 0(n log log n log log* M) 
time, where w' = 0(n log iV log* M). Next we construct a compacted trie of size 
0(m) that represents the m patterns, which we denote by PTree (pattern tree). 
Formally, each non-root node of PTree represents either a pattern or the longest 
common prefix of some pair of patterns. PTree can be built by using LCP of The¬ 
orem [S] in 0(m log m log A^) time. We let each node have its string depth, and 
the pointer to its deepest ancestor node that represents a pattern if such exists. 
Further, we augment PTree with a data structure for level ancestor queries so 
that we can locate any prefix of any pattern, designated by a pattern and length, 
in PTree in O(logm) time by locating the string depth by binary search on the 
path from the root to the node representing the pattern. Supposing that we know 
the longest prefix of T[i..|T|] that is also a prefix of one of the patterns, which 
we call the max-prefix for i, PTree allows us to output occi patterns occurring at 
position i in 0{logm-\- ocCi) time. Hence, the pattern matching problem reduces 
to computing the max-prefix for every position. 

In the pattern matching phase, our algorithm processes T in a streaming 
fashion, i.e., each character is processed in increasing order and discarded be¬ 
fore taking the next character. Just before processing T[j 1], the algorithm 
maintains a pair of signature p and integer I such that val{p)[l..l] is the longest 
suffix of T[l..j] that is also a prefix of one of the patterns. When T[j 1] comes, 
we search for the smallest position i G {j — I 1,... ,j -i- 1} such that there 
is a pattern whose prefix is T[i..j I]. For each i G {j — 1 -I- 1,... ,j -I- 1} in 
increasing order, we check if there exists a pattern whose prefix is T[i..j 1] by 
binary search on a sorted list of m patterns. Since T[i..j] = val{p)[i — j 1..1], 

LCE with p can be used for comparing a pattern prefix and T[i..j 1] (except 
for the last character T[j J- I]), and hence, the binary search is conducted in 
0{\ogm\ogN\og* M) time. For each i, if there is no pattern whose prefix is 
T[i..j 1], we actually have computed the max-prefix for i, and then we output 
the occurrences of patterns at i. The time complexity is dominated by the binary 
search, which takes place 0{\T\) times in total. Therefore, the algorithm runs in 
0(|T| logm log Af log* A^-I-occ) time. 

By the way, one might want to know occurrences of patterns as soon as 
they appear as Aho-Corasick automata do it by reporting the occurrences of 
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the patterns by their ending positions. Our algorithm described above can be 
modified to support it without changing the time and space complexities. In the 
preprocessing phase, we additionally compute RPTree (reversed pattern tree), 
which is analogue to PTree but defined on the reversed strings of the patterns, 
i.e., RPTree is the compacted trie of size 0{m) that represents the reversed 
strings of the m patterns. Let T\i..j] be the longest suffix of T[l..j] that is also 
a prefix of one of the patterns. A suffix T[i'..j] of T[i..j] is called the max-sujfix 
for j iff it is the longest suffix of T[i..j] that is also a suffix of one of the patterns. 
Supposing that we know the max-suffix for j, RPTree allows us to output eoccj 
patterns occurring with ending position j in 0{logm+ eocCj) time. Given a pair 
of signature p and integer I such that T[i..j] = val{p)[l..l], the max-suffix for j 
can be computed in 0(logTOlog A^log* N) time by binary search on a list of m 
patterns sorted by their “reversed” strings since each comparison can be done 
by “leftward” LCE with p. Except that we compute the max-suffix for every 
position and output the patterns ending at each position, everything else is the 
same as the previous algorithm, and hence, the time and space complexities are 
not changed. 
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